Skip to content

Conversation

@jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Oct 21, 2025

Bug Description 🐞

When multiple state changes happen outside a TimePeriod, NotificationComponent::FireSuppressedNotifications() will clear notifications that don't apply too soon, leading the state of the suppressed notifications to become inconsistent.

for (auto type : {NotificationProblem, NotificationRecovery, NotificationFlappingStart, NotificationFlappingEnd}) {
if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
}
}

Ultimately this will cause any suppressed notifications to be discarded when the TimePeriod resumes.

Fix Description 🔧

The trivial fix is to just move the check on NotificationReasonApplies() inside the tp->IsInside() condition, thereby removing them after the TimePeriod resumes and leaving the list of suppressed notifications intact until no further notifications will be added to it.

I'm not entirely certain what the purpose of that check is, but the only scenario I could think of is when a TimePeriod begins and a check result produces a notification before NotificationTimerHandler can run and send out the suppressed notifications. I've added a unit-test to ensure that behavior is the same as before.

@Al2Klimov, since you're the one who wrote the original code, maybe you can give other scenarios where this check might be affecting and that could potentially stop working with this PR. In that case we can add additional test-cases and look for a more complex fix.

Unit-Tests ✅

I've added tests for some common scenarios including the bugged behavior of the issue this PR fixes. This way the behavior of the NotificationComponent can be verified before and after this PR. I'm happy to add additional test-cases for behavior that might be affected by this issue and the fix.

Fixes #10575

@cla-bot cla-bot bot added the cla/signed label Oct 21, 2025
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from a6ba476 to 20560af Compare October 21, 2025 13:19
Al2Klimov
Al2Klimov previously approved these changes Oct 21, 2025

if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't read the issue, but the fix shouldn't hurt. After all, this is also the code branch where notifications are sent.

@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch 3 times, most recently from 1fd58ee to 22d7a0d Compare October 22, 2025 08:23
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch 2 times, most recently from c08038d to fea4554 Compare October 23, 2025 09:09
@jschmidt-icinga jschmidt-icinga marked this pull request as ready for review October 23, 2025 09:12
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from fea4554 to 9fa3acc Compare October 28, 2025 07:44
@jschmidt-icinga jschmidt-icinga modified the milestones: 2.16.0, 2.14.8 Oct 28, 2025
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from 9fa3acc to 41b5383 Compare October 28, 2025 14:32
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain what the purpose of that check is, but the only scenario I could think of is when a TimePeriod begins and a check result produces a notification before NotificationTimerHandler can run and send out the suppressed notifications.

That's pretty much the reasoning behind it. Its purpose is to avoid sending notifications that are no longer applicable. For instance, if a problem notification was suppressed due to being outside the time period, but recovers right after the time period starts and before the NotificationTimerHandler runs, it won't clear the suppressed problem notification, thus sending out the suppressed problem notification would be incorrect.

@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch 5 times, most recently from 1bb78fa to 393a267 Compare October 29, 2025 08:55
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from 393a267 to 2d32c6b Compare November 3, 2025 13:59
This includes a few common scenarios and a reproduction of the current behavior
affected by the underlying bug of issue #10575. This is done both to document
the change in behavior, as well as to ensure the behavior of the other scenarios
stays the same before and after the fix is applied.
Without this commit, every time the NotificationTimerHandler runs it
will discard notifications that don't apply to the reason of the latest
check result. This is probably intended to clear outdated suppressed
notifications immediately when the TimePeriod resumes, but it also clears
out important ones (see the test case).

This commit fixes that by clearing out inapplicable notifications when
the timer runs the first time after the TimePeriod resumes. By that time
we can expect that no new suppressed notifications will be added and all
notifications that don't conflict with the last check-result can still be
run.

Fixes #10575
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from 2d32c6b to 75c7d28 Compare November 3, 2025 14:39
@yhabteab yhabteab added bug Something isn't working area/notifications Notification events labels Nov 3, 2025
@yhabteab yhabteab enabled auto-merge November 3, 2025 14:50
@yhabteab yhabteab merged commit 35fdea8 into master Nov 3, 2025
25 checks passed
@yhabteab yhabteab deleted the clear-suppr-notif-after-tp-resume branch November 3, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/notifications Notification events bug Something isn't working cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recovery notification outside time period still lost

4 participants